eXplainable AI
11/10/23
age and class. This is the real model response surface for a logistic regression model with splines.
Sound like a spell from Harry Potter?
Ceteris paribus is a Latin phrase, meaning ,,all other things being equal’’ or ,,all else unchanged’’, see Wikipedia.
It is a function defined for model \(f\), observation \(x\), and variable \(j\) as:
\[\begin{equation} h^{f}_{x,j}(z) = f\left(x_{j|=z}\right), \end{equation}\]
where \(x_{j|=z}\) stands for observation \(x\) with \(j\)-th coordinate replaced by value \(z\).
The Ceteris Paribus profile is a function that describes how the model response would change if \(j\)-th variable will be changed to \(z\) while values of all other variables are kept fixed at the values specified by \(x\).
In the implementation we cannot check all possible z’s, we have to meaningfully select a subset of them, we will come back to this later.
Note that CP profiles are also commonly referred as Individual Conditional Expectation (ICE) profiles. This is a common name but might be misleading if the model does not predict the expected value.
Pros
Cons
\[ g^{PD}_{j}(z) = E_{X_{-j}} f(X_{j|=z}) . \]
\[ \hat g^{PD}_{j}(z) = \frac{1}{n} \sum_{i=1}^{n} f(x^i_{j|=z}). \]
Let’s consider a simple linear regression model
\[ f(x) = \hat \mu + \hat \beta_1 x_1 + \hat \beta_2 x_2. \] Then we have
\[ g_{PD}^{1}(z) = E_{X_2} [\hat \mu + \hat \beta_1 z + \hat \beta_2 x_2] = \] \[ \hat \mu + \hat \beta_1 z + \hat \beta_2 E_{X_2} [x_2] = \] \[ \hat \beta_1 z + c \]
What is the problem?
Marginal distribution of \(X_2\) on the left and conditional distribution of \(X_2|x_1=0.4\) on the right.
\[ g^{MP}_{j}(z) = E_{X_{-j}|x_j=z} f(x_{j|=z}) . \]
\[ \hat g^{MP}_{j}(z) = \frac{1}{|N(x_j)|} \sum_{i \in N(x_j)} f(x^i_{j|=z}). \]
Let’s consider a simple linear regression model
\[ f(x) = \hat \mu + \hat \beta_1 x_1 + \hat \beta_2 x_2, \]
with \(X_1 \sim \mathcal U[0,1]\), while \(x_2=x_1\) (perfect correlation).
Then we have
\[ g^{MP}_{1}(z) = E_{X_2|x_1=z} [\hat \mu + \hat \beta_1 z + \hat \beta_2 x_2] = \] \[ \hat \mu + \hat \beta_1 z + \hat \beta_2 E_{X_2|x_1=z} [x_2] = \] \[ (\hat \beta_1 + \hat \beta_2) z + c \]
We have solved one problem, two new problems have emerged.
As we saw in the previous example, marginal effects carry the cummulative effect of all correlated variables. But is this what we wanted?
No. We want to take correlations into account, but distil the individual contribution of the variable \(x_j\). For this we will use Accumulated Local Effects.
\[ g^{AL}_{j}(z) = \int_{z_0}^z \left[E_{X_{-j}|x_j=v} \frac{\partial f(x)}{\partial x_j} \right] dv . \] - As before, estimating the conditional distribution is difficult. We can deal with it by using a similar trick with \(k\) segments of the variable \(x_j\). - Let \(k_j(x)\) denote the interval in which the observation \(x\) is located in respect to the variable \(j\). - We will accumulate local model changes on the interval\([z^{k-1}_j, z^k_j]\).
\[ \hat g^{AL}_{j}(z) = \sum_{k=1}^{k_j(x)} \frac{1}{|N_j(k)|} \sum_{i:x_{i,j} \in N_j(k)} [f(x_{j|=z^k_j}) - f(x_{j|=z^{k-1}_j})] + c. \]
Let’s consider a simple linear regression model
\[ f(x) = \hat \mu + \hat \beta_1 x_1 + \hat \beta_2 x_2, \]
with \(X_1 \sim \mathcal U[0,1]\), while \(x_2=x_1\) (perfect correlation).
Then we have
\[ g^{AL}_{1}(z) = \int_0^z E_{X_2|x_1=v} \frac{\partial ( \hat \mu + \hat \beta_1 x_1 + \hat \beta_2 x_2)}{\partial x_1} dv + c= \] \[ \int_0^z E_{X_2|x_1=v} \hat \beta_1 dv +c = \] \[ \int_0^z \hat \beta_1 dv +c = \] \[ \hat \beta_1 z + c \]
Let’s consider a following model
\[ f(x_1, x_2) = (x_1 +1) x_2 \]
where \(X^1 \sim \mathcal U[-1,1]\) and \(x_1=x_2\).
\[ h^1_{CP}(z) = (z+1)x_2 \]
\[ g^1_{PD}(z) = E_{X_2} (z+1)x_2 = 0 \]
\[ g_1^{MP}(z) = E_{X_2|x_1=z} (z+1)x_2 = z(z+1) \]
\[ g_1^{AL}(z) = \int_{-1}^z E_{X_2|x_1=v} \frac{\partial (x_1+1)x_2}{\partial x_1} dv = \int_{-1}^z E_{X_2|x_1=v} x_2 dv = \] \[ \int_{-1}^z v dv = (z^2 - 1)/2 \]
Let’s explain the model with a following sample
| i | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
| \(X^1\) | -1 | -0.71 | -0.43 | -0.14 | 0.14 | 0.43 | 0.71 | 1 |
| \(X^2\) | -1 | -0.71 | -0.43 | -0.14 | 0.14 | 0.43 | 0.71 | 1 |
| \(y\) | 0 | -0.2059 | -0.2451 | -0.1204 | 0.1596 | 0.6149 | 1.2141 | 2 |
The figure below summaries differences between PD, ME and ALE.

eXplainable AI – Introduction – MIM UW – 2023/24